我们提出了一种新的基于网格的学习方法(N-Cloth),适用于合理的3D布变形预测。我们的方法是通用的,可以处理具有任意拓扑的三角网格表示的布料或障碍物。我们使用Graph卷积将布料和对象网格转换为潜在空间以减少网格空间中的非线性。我们的网络可以基于初始布网格模板和目标障碍物网的状态来预测目标3D布网格变形。我们的方法可以处理复杂的布料网格,最高可达100美元的k三角形和场景,具有与SMPL人,非SMPL人或刚体相对应的各种对象。在实践中,我们的方法展示了连续输入框架之间的良好时间相干性,并且可用于在NVIDIA GeForce RTX 3090 GPU上以30-45美元的$ 30-45 $ FPS产生合理的布料模拟。我们突出了以前基于学习的方法和基于物理的布料模拟器的好处。
translated by 谷歌翻译
为了解决不平衡分类任务中生成图像的质量多样性的权衡问题,我们研究了功能级别的基于过度采样的方法,而不是数据级别,并专注于搜索潜在功能空间以进行最佳分布。在此基础上,我们提出了改进的基于潜在特征分布演化(MEDA_LUDE)算法的改进的估计分布算法,其中对联合学习程序进行了编程,以使深神经网络和进化算法分别优化和进化。我们探讨了大利润度高斯混合物(L-GM)损失功能对分配学习和设计基于样品之间相似性以增加多样性的专业健身函数的影响。基于基准的不平衡数据集的广泛实验验证了我们提出的算法的有效性,该算法可以生成具有质量和多样性的图像。此外,MEDA_LUDE算法还应用于工业领域,并成功地减轻了织物缺陷分类中的不平衡问题。
translated by 谷歌翻译
我们建议以人为本的4D场景捕获(HSC4D)准确有效地创建一个动态的数字世界,其中包含大规模的室内场景,各种各样的人类动作以及人类与环境之间的丰富互动。 HSC4D仅使用车身安装的IMU和LIDAR,没有任何外部设备的限制和无图形地图,没有预构建的地图。考虑到IMU可以捕获人的姿势,但始终为长期使用而漂移,而LiDar对于全球本地化却是稳定的,但对于本地位置和方向而言,HSC4D使两个传感器通过联合优化和实现长期的有希望的结果相互补充。捕获。还探索了人与环境之间的关系,以使其相互作用更加现实。为了促进许多下游任务,例如AR,VR,机器人,自动驾驶等,我们提出了一个数据集,其中包含三个大型场景(1k-5k $ m^2 $),并具有准确的动态人类动作和位置。各种场景(攀岩馆,多层建筑,坡度等)以及挑战人类活动(锻炼,上下楼梯,攀岩等)展示了HSC4D的有效性和概括能力。数据集和代码可在http://www.lidarhumanmotion.net/hsc4d/上获得。
translated by 谷歌翻译
注意机制已成为场景文本识别方法(STR)方法中的事实上的模块,因为它有能力提取字符级表示。可以将这些方法汇总到基于隐性注意力的基于隐性的注意力和受监督的注意力中,取决于如何计算注意力,即分别从序列级别的文本注释和字符级别的边界框注释中学到隐性注意和监督注意力。隐含的注意力可能会提取出粗略甚至不正确的空间区域作为性格的注意,这很容易受到对齐拖延问题的困扰。受到监督的注意力可以减轻上述问题,但它是特定于类别的问题,它需要额外费力的角色级边界框注释,并且当角色类别的数量较大时,将是记忆密集的。为了解决上述问题,我们提出了一种新型的关注机制,用于STR,自我保护的隐式字形注意力(SIGA)。 Siga通过共同自我监督的文本分割和隐性注意对准来描述文本图像的字形结构,这些文本分割和隐性注意对准可以作为监督,以提高注意力正确性,而无需额外的角色级注释。实验结果表明,就注意力正确性和最终识别性能而言,SIGA的性能始终如一地比以前的基于注意力的STR方法更好,并且在公开可用的上下文基准上以及我们的无上下文基准。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译